## **XLA Architecture**



## Why did we build XLA?

We had several objectives for XLA to work with TensorFlow:

- Improve execution speed. Compile subgraphs to reduce the execution time of short-lived
  Ops to eliminate overhead from the TensorFlow runtime, fuse pipelined operations to
  reduce memory overhead, and specialize to known tensor shapes to allow for more
  aggressive constant propagation.
- *Improve memory usage*. Analyze and schedule memory usage, in principle eliminating many intermediate storage buffers.
- Reduce reliance on custom Ops. Remove the need for many custom Ops by improving the
  performance of automatically fused low-level Ops to match the performance of custom
  Ops that were fused by hand.
- Reduce mobile footprint. Eliminate the TensorFlow runtime by ahead-of-time compiling
  the subgraph and emitting an object/header file pair that can be linked directly into
  another application. The results can reduce the footprint for mobile inference by several
  orders of magnitude.
- Improve portability. Make it relatively easy to write a new backend for novel hardware, at which point a large fraction of TensorFlow programs will run unmodified on that hardware. This is in contrast with the approach of specializing individual monolithic Ops

for new hardware, which requires TensorFlow programs to be rewritten to make use of those Ops.

## How does XLA work?

The input language to XLA is called "HLO IR", or just HLO (High Level Operations). The semantics of HLO are described on the <u>Operation Semantics</u>

(https://www.tensorflow.org/xla/operation\_semantics) page. It is most convenient to think of HLO as a <u>compiler IR</u> (https://en.wikipedia.org/wiki/Intermediate\_representation).

XLA takes graphs ("computations") defined in HLO and compiles them into machine instructions for various architectures. XLA is modular in the sense that it is easy to slot in an alternative backend to <u>target some novel HW architecture</u>

(https://www.tensorflow.org/xla/developing\_new\_backend). The CPU backend for x64 and ARM64 as well as the NVIDIA GPU backend are in the TensorFlow source tree.

The following diagram shows the compilation process in XLA:



XLA comes with several optimizations and analysis passes that are target-independent, such as <u>CSE</u> (https://en.wikipedia.org/wiki/Common\_subexpression\_elimination), target-independent operation fusion, and buffer analysis for allocating runtime memory for the computation.

After the target-independent step, XLA sends the HLO computation to a backend. The backend can perform further HLO-level optimizations, this time with target specific information and needs in mind. For example, the XLA GPU backend may perform operation fusion beneficial specifically for the GPU programming model and determine how to partition the computation into streams. At this stage, backends may also pattern-match certain operations or combinations thereof to optimized library calls.

The next step is target-specific code generation. The CPU and GPU backends included with XLA use <u>LLVM</u> (http://llvm.org) for low-level IR, optimization, and code-generation. These

backends emit the LLVM IR necessary to represent the XLA HLO computation in an efficient manner, and then invoke LLVM to emit native code from this LLVM IR.

The GPU backend currently supports NVIDIA GPUs via the LLVM NVPTX backend; the CPU backend supports multiple CPU ISAs.

Except as otherwise noted, the content of this page is licensed under the <a href="Creative Commons">Creative Commons</a>
<a href="Attribution 4.0 License">Attribution 4.0 License</a> (https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the <a href="Apache 2.0 License">Apache 2.0 License</a> (https://www.apache.org/licenses/LICENSE-2.0). For details, see the <a href="Google Developers Site Policies">Google Developers Site Policies</a> (https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2021-01-28 UTC.